-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CharRange and CharIter #111
Conversation
Expanded documentation and tests are TODO but this should work
license = "MIT/Apache-2.0" | ||
keywords = ["text", "unicode", "iteration"] | ||
description = "UNIC - CharRange" | ||
categories = ["text-processing", "iteration"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't actually checked that this is valid yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no "iteration" category. Maybe we don't need more than one. Here's the list: https://crates.io/categories/
FYI, I've got "internationalization" (and "localization") added to the list, but it hasn't made it to the live version yet. When that happens, almost all UNIC components get both text-processing
and internationalization
.
Benchmark results from
Benchmark results from
... I've got a bit of work to do don't I ... |
I can not for the life of me figure out why I'm going to try re-reimplementing this but closer to |
I figured out where I was losing time. In my experiments I think I found a cleaner for our uses design though... and it involves fewer moving parts (I think) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on getting the optimization right! 💯
Let me sleep on the type name and macro syntax. I don't want us to diverge from rustc concepts too much, so it's easier for new users to adapt, as well as putting things back into rustc. Specially, about the macro, I think we better support the common syntax, at least.
The code looks very good. A bunch of nits and a question inline.
repository = "https://github.com/behnam/rust-unic/" | ||
license = "MIT/Apache-2.0" | ||
keywords = ["text", "unicode", "iteration"] | ||
description = "UNIC - CharRange" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be better to use more informative language in description, as it's the only text that shows up on crates.io lists and item page, and allows better matching in search.
To follow the pattern from other UNIC components, it can be something like this:
description = "UNIC - Unicode Characters - Character Range and Iteration"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return None; | ||
} | ||
|
||
let char = unsafe { char::from_u32_unchecked(self.low) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know it's possible to call a variable char
! That's cool! But also uncommon practice, but I don't know why! Anyways, may be safer to just call it ch
or chr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 it's a bad habit. It's not something you'd really expect to work, but it does (I think because Rust does type and variable name lookup differently). Surprisingly though, my IDE highlights it right, which is part of why I do it even though I probably shouldn't.
} | ||
let naive_len = self.high as usize - self.low as usize; | ||
if self.low <= SURROGATE_RANGE.start && SURROGATE_RANGE.end <= self.high { | ||
naive_len - SURROGATE_RANGE.len() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can optimize here by putting this number in a const
variable, but could be unnecessary. Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately doing so and still calling len()
would require const fn
to be stable. I can see if precomputing it manually saves any non-negligible amount of time though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out it takes 2 ns (+/- 0 ns)
either way.
|
||
// ensure `high` is never one greater than a surrogate code point | ||
if self.high == SURROGATE_RANGE.end { | ||
self.high = SURROGATE_RANGE.start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation with the new invariants is even more clean and slim! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll like #112 then 😄
//! ``` | ||
//! | ||
#![forbid(bad_style, missing_debug_implementations, unconditional_recursion)] | ||
#![deny(missing_docs, unsafe_code, unused, future_incompatible)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why unused, future_incompatible
are in deny()
but not forbid()
? Doesn't look like we allow()
them anywhere in this crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d24057a for future_incompatible
. (It doesn't work on 1.17 for some reason.) unused
, just because I was allow
ing it temporarily while building the library up.
Basically, I picked deny
for things that could plausibly be temporarily allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I guess it's fine for non-major flags, and we should probably just rely on clippy (in a future version) to hint on all forbid
able deny
s.
|
||
/// Does this range include a character? | ||
#[inline] | ||
pub fn contains(&self, char: char) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same about the char
variable name.
@@ -0,0 +1,40 @@ | |||
use std::char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have a module here called char
, I think it would be easier to read the code if we stick with std::char
instead of using it as char
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the problem is std::char
and the primitive type char
conflicting, that's an intended overlap. The char
module contains things like char::from_u32
and char::MAX
which should act like they are on the primitive char
.
If you're saying unic-char
, that's not referenced anywhere from my code. The only time there is a char
module there is from inside the main unic
crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought we have a char.rs
here next to step.rs
. You're right, there's no conflict and it's fine. #aftermidnightreview
/// | ||
/// If the given `char` is already `char::MAX`, it is returned unchanged. | ||
#[allow(unsafe_code)] | ||
pub fn forward(c: char) -> char { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the same variable name here would make is easier to read.
/// If the given `char` is already `char::MAX`, it is returned unchanged. | ||
#[allow(unsafe_code)] | ||
pub fn forward(c: char) -> char { | ||
if c == char::MAX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, these checks result in an open_range()
to silently create a CharRange
with first
> last
. For example, if we go with open_range('\u{0x20}', '\u{0x21}'), then
first == U+21and
last = U+20`. Does this violate any invariants, or we don't care?
(It doesn't look like causing any issues right now, but may be with expansions of the API.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same with ..
syntax: you can create a "backwards" range and it just silently gives you a finished Range
. We can put panic!
guards around the constructors for improper use, but the standard library here has set a precedent of making an empty range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Yeah, as long as the behavior is intended and works similar to std lib, it's fine.
I think it worth adding a hint about it in the constructors, though.
#[test] | ||
fn iter_fused() { | ||
let mut iter = CharRange::all().iter(); | ||
let mut fused = all_chars().into_iter().fuse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be gated behind the fused
feature. Surprised that it's passing the CI!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fn to fuse an iterator is stable. It's merely the trait to declare yourself as fused that is unstable.
Btw, also take a look at https://doc.rust-lang.org/std/collections/enum.Bound.html and its usage in https://doc.rust-lang.org/std/collections/range/trait.RangeArgument.html . |
Whoops, replaced it while you were reviewing... Much was shared though so it's not lost effort. |
As for the macro, the syntax was a curveball idea. I'm sticking with just the standard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! ``` | ||
//! | ||
#![forbid(bad_style, missing_debug_implementations, unconditional_recursion)] | ||
#![deny(missing_docs, unsafe_code, unused, future_incompatible)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I guess it's fine for non-major flags, and we should probably just rely on clippy (in a future version) to hint on all forbid
able deny
s.
@@ -0,0 +1,40 @@ | |||
use std::char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought we have a char.rs
here next to step.rs
. You're right, there's no conflict and it's fine. #aftermidnightreview
/// If the given `char` is already `char::MAX`, it is returned unchanged. | ||
#[allow(unsafe_code)] | ||
pub fn forward(c: char) -> char { | ||
if c == char::MAX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Yeah, as long as the behavior is intended and works similar to std lib, it's fine.
I think it worth adding a hint about it in the constructors, though.
112: [char/range] Add CharRange and CharIter r=behnam The first half of adressing #91. Closes #111, this manner of attack is better than it. This PR only has one type, `CharRange`. It is effectively `std::ops::RangeInclusive` translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for `'a'..='z'` syntax.
Expanded documentation and tests are TODO -- working on those presently.
We can land the patch to use the char range in a different PR.